-
Notifications
You must be signed in to change notification settings - Fork 62
TST: measure and log individual test durations in CI #677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
TST: measure and log individual test durations in CI #677
Conversation
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #677 +/- ##
===========================================
- Coverage 62.28% 45.19% -17.09%
===========================================
Files 11 87 +76
Lines 928 16835 +15907
===========================================
+ Hits 578 7608 +7030
- Misses 350 9227 +8877 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I completely missed there were two separate calls to pytest, and of course I'm measuring the wrong one. Will update when back near a keyboard |
|
I'm just noticing now that |
|
I see the longest running tests are respectively ~30 and ~15 min . This is extremely long, CI jobs usually can complete well under a minute, so I'm puzzled. |
They do genuinely take this long and I'm aware it's far from ideal. I've been wanting to improve this for a while, and my approach has been to gradually speed up the functions themselves. This has had some degree of success since they used to run for almost 3 hours. The tests spend the vast majority of their time on two functions - fm.firstguess and fm.mcmc_negfc_sampling. These are important functions but there might be a way to make the test less intensive while still simulating their real use. As for the python versions I'm open to ideas about this. They often reveal dependency issues and deprecated code so testing each version has actually been helpful. Just installing packages and import VIP doesn't always trigger an error. Do tests usually just have a single version of python? One option is to run the tests on 3.13 and the earlier versions just install and import VIP to see if anything pops up? As for what the tests are doing, I believe Valentin has some small example data that he provides to most of the functions several times to test each use case. |
|
Yes, running most your tests against multiple Python versions is extremely valuable. What I'm proposing is to exclude a selection of the suite from normal jobs and run this same subset against a single Python version to save resources (and possibly some wall time as well) |
|
I'm okay with this, we'll see what Valentin thinks |
|
Other than that, regarding performance, have you considered rewriting performance critical loops in a low level language (C, C++, Rust, or event Cython) ? |
|
@neutrinoceros thanks for all of this! just going to piggy-back off this most recent PR and ask if there is an order to commit each of these PRs in? Or just accept the ones that are ready and leave the drafts? |
|
Yes, there's an order in which my PRs are intended to be reviewed and merged. I'm very careful to only undraft the ones that have no dependencies, so please prioritize everything explicitly marked as ready for review (as in, not drafted). If more than one is ready at any given point, they should be independent and order doesn't matter. |
|
Hi @neutrinoceros, thanks for raising this. As @IainHammond explained, we have put some efforts to try to reduce as much as possible the tests computing time although we have started to hit a barrier in order to (i) keep them meaningful and (ii) cover as many parts of the code at the same time as possible (to avoid minimizing overheads from partly re-running the same bits of preliminary codes in different tests). That being said, I dived in the longest remaining tests, and I think there is a way to reduce the longest one by ~40%, while not reducing the coverage nor the meaningfulness, as explained below. Thanks btw for highlighting these with the proposed change (which I think should also be fine to keep in the PR, unless you see a reason not too?). In essence, for now the To further reduce the length of the 2 longest tests ( |
That's okay with me ! I initially used a pytest plugin to measure durations, but I didn't want to add another dependency, long term. Now that I've realized pytest had this feature natively, I don't mind if you guys want to merge the change as is.
I'll give it a spin !
I get the noble intention, but I find this approach quite counter-productive long term. I'll try to explain my perspective here (and to keep it brief :)): Anyway, these are very general considerations, I'm not making a judgment on VIP specifically, nor have I studied it deeply enough to see where low hanging fruits would be. |
a100fe5 to
a60148b
Compare
|
Thanks for your explanations @neutrinoceros. I agree with the idea of testing as little as possible in each chunk. I think the test suite grew somewhat organically in VIP, and in some cases we saw the opportunity to incorporate very short bits of codes leveraging the outputs of another tested algorithm, which involved less code hassle than designing new tests. |
This is a trial run, not intended for merging (at least not yet), as I'm curious to see exactly which test(s) is taking so long that CI requires about 2 hours.